Digit and Simple Voice Recognizer

A simple speech pattern recognition by utilizing MFCCs (Mel Frequency Cepstral Coefficients) and Dynamic Time Warping (DTW) to match given template speech set with the test set.

Required additional library to run this notebook:
Librosa

@edwardpassagi on GitHub

Library Imports

In [1]:
# Library Import and basic function definition
from scipy.io import wavfile

import random

import scipy
import scipy.spatial.distance as dis
import scipy.signal as signal
import numpy as np
import IPython.display as ipd
import librosa

# Progress Visualization
from tqdm import tqdm
from tqdm.auto import tqdm, trange

# Ignore MFCC warning due to wavfile tag
import warnings
warnings.filterwarnings('ignore')

# Print Sound
def sound( x, rate=8000, label=''):
    from IPython.display import display, Audio, HTML
    if label is '':
        display( Audio( x, rate=rate))
    else:
        display( HTML( 
        '<style> table, th, td {border: 0px; }</style> <table><tr><td>' + label + 
        '</td><td>' + Audio( x, rate=rate)._repr_html_()[3:] + '</td></tr></table>'
        ))

1. Data Preparation

Audio Import and MFCC transformation

Since I'll be comparing MFCC data for each audio (template and test), make sure to get both the WAV file and the MFCC representation for each audio file.

In [2]:
# sr = 44100
sr = wavfile.read("./digits_samples/template.wav")[0]

# take L channel
template = np.array(wavfile.read("./digits_samples/template.wav")[1][:,0], dtype=float)
test = np.array(wavfile.read("./digits_samples/test.wav")[1][:,0], dtype=float)

# find MFCC for both sets
templateMFCC = librosa.feature.mfcc(template, sr, n_mfcc = 50)
testMFCC = librosa.feature.mfcc(test, sr, n_mfcc = 50)

Parse the dataset to each digits

In [3]:
# parse template to 10 MFCC and 10 digits
tempMFList = []
tempDigs = np.array(np.array_split(template,10))

# parse testing to 110 MFCCs and 110 digits
testMFList = []
testDigs = np.array(np.array_split(test,110))

for i in range(10):
    tempMFList.append(librosa.feature.mfcc(tempDigs[i], sr, n_mfcc = 50))
    
for i in range(110):
    testMFList.append(librosa.feature.mfcc(testDigs[i], sr, n_mfcc = 50))

Test digits is represented in testIndex mod(10)

In [4]:
# sound of some template digits
print("Template digits")
for i in range(0,10,2):
    pr = "number: "+str(i)
    sound(tempDigs[i], sr, pr)

# sound of some test digits

print("Test digits")
for i in range(90,100,2):
    pr = "number: "+str(i)
    sound(testDigs[i], sr, pr)
Template digits
number: 0
number: 2
number: 4
number: 6
number: 8
Test digits
number: 90
number: 92
number: 94
number: 96
number: 98

2. The Algorithm

Since each digits (or speech) can be spoken in vastly different ways, we want to make sure to ignore small, irrelevant differences to make up the meaning of the voice.

Thus, we can approach this problem by comparing each MFCC slices for the test and the templates, while finding the most optimal route (low cost) to determine our predicted result.
In this algorithm, we'll be using Bellmann's Optimality Principle for our pathfinding method. (Further reading here.

First, we need to represent the distances of our representative matrix (in my case, the MFCC form). This can be done by this formula:
$$D(\mathbf{a},\mathbf{b}) = \frac{\sum a_i b_i}{\sqrt{a_i^2}\sqrt{\sum b_i^2}}$$
where (i,j) represents the distance between the i-th frame of the template with the j-th frame of the input.

In [5]:
def D_mat(a,b):
    D = np.zeros((len(a.T), len(b.T)))
    for i, matA in enumerate(a.T):
        for j, matB in enumerate(b.T):
            # get cosine distance between the two frames
            D[i,j] = dis.cosine(matA,matB)
    return D

Second, we can compute the cost matrix, where we find the most optimal path to reach (i,j), in this case, let's set a constraint to just consider nodes coming from (i-1,j), (i-1,j-1), and (i-1,j-1).

In [6]:
def C_mat(D):
    C = D.copy()
    
    for i in range(1, C.shape[0]):
        for j in range(1, C.shape[1]):
            curr = C[i,j]
            W = C[i, j-1] + curr
            NW = C[i-1,j-1] + curr
            N = C[i-1,j] + curr
            # assign lowest value to C matrix
            C[i,j] = np.nanmin([W,NW,N])
    return C

Finally, we can form our classifier by using the lowest cost path to determine our prediction.

In [7]:
def classify(inputMFCC, templateMFCC, window = 40):
    retval = np.zeros(len(templateMFCC))
    
    for i, templateFrame in enumerate(templateMFCC):
        D = D_mat(inputMFCC, templateFrame)
        C = C_mat(D)
        
        # get minimum cost from both edges
        # only consider the last half
        opt = min(min(C[window:,-1]),min(C[-1,window:]))
        retval[i]=opt
    
    # return the minimum index
    return np.argmin(retval)

We are now done with our algorithm, and can now test it with our test sets.

3. Digit Recognizer

Remember that the actual digit (0-9) is represented as testIndex mod(10).

In [8]:
# Test on digit 3
testIndex = 93

predicted = classify(testMFList[testIndex], tempMFList)
# testMFList
# classify_c(tempMFList[2], tempMFList)
print("Actual: {}, Predicted: {}".format(testIndex%10,predicted))
sound(testDigs[testIndex], sr, "Actual digit")
sound(tempDigs[predicted], sr, "Template digit")
Actual: 3, Predicted: 3
Actual digit
Template digit

It seems that our algorithm works just fine. To determine its accuracy, we can run it on many different test sets.

I have 110 different sounds that I can test it against.

In [9]:
correct = np.zeros(10)
for i in trange(110, desc = 'Test set'):
    guessedval = classify(testMFList[i], tempMFList)
    actualIndex = i%10
    
    # Determine whether or not its correctly guessed
    correct[actualIndex] = correct[actualIndex]+1 if guessedval==actualIndex else correct[actualIndex]

Summary

We can now see our accuracy for each digits:

In [10]:
## Data Summary
print("Data Summary:\n")
totalCorrectDigit = int(np.sum(correct))
print("Total Accuracy: {}%, Correct Guesses: {}, False Guesses: {}\n".format(totalCorrectDigit/110*100, totalCorrectDigit, 110-totalCorrectDigit))

for idx, c in enumerate(correct):
    print("Digit {} Accuracy: {}%".format(idx, c/11*100) )
Data Summary:

Total Accuracy: 99.0909090909091%, Correct Guesses: 109, False Guesses: 1

Digit 0 Accuracy: 100.0%
Digit 1 Accuracy: 100.0%
Digit 2 Accuracy: 100.0%
Digit 3 Accuracy: 100.0%
Digit 4 Accuracy: 100.0%
Digit 5 Accuracy: 100.0%
Digit 6 Accuracy: 100.0%
Digit 7 Accuracy: 100.0%
Digit 8 Accuracy: 90.9090909090909%
Digit 9 Accuracy: 100.0%

4. Voice-driven dialler

We can now implement what we have made for something that can be used in our daily lives.

In this case, we can make a voice-driven dialler, where we can set up numbers for each our friends (by saying each friend's name followed by their phone number), and call them just by saying their names.

Let's start by importing our audio files, and parsing them (similar to step 2).

Audio Import and Parsing

In [11]:
# sr = 44100
sr = wavfile.read("./voice_dialler/input.wav")[0]

# take L channel
tempVD = np.array(wavfile.read("./voice_dialler/input.wav")[1][:,0], dtype=float)
testVD = np.array(wavfile.read("./voice_dialler/names.wav")[1][:,0], dtype=float)
In [12]:
print("Input data:")
sound(tempVD, sr, "Setup audio")
sound(testVD, sr, "Test names")
Input data:
Setup audio
Test names

Contact List:

Names Phone Number
Furkan 1379
Simon 5240
Mohamed 6683
Edward 7134
Amir 9523
In [13]:
recipientNum = 5
phoneDigitsAmt = 4
names = ["Furkan", "Simon","Mohamed","Edward","Amir"]

# parse template to 10 MFCC and 10 digits
tempMFListVD = []
tempWAV = np.array(np.array_split(tempVD,10))

# parse testing to 110 MFCCs and 110 digits
testMFListVD = []
testWAV = np.array(np.array_split(testVD,10))
    
for i in range(10):
    testMFListVD.append(librosa.feature.mfcc(testWAV[i], sr, n_mfcc = 50))
In [14]:
# sound of some template digits
print("template chunks")
for i in range(10):
    pr = "chunk: "+str(i)
    sound(tempWAV[i], sr, pr)

# sound of some test digits
# test digits is represented in testIndex mod(10)
print("test names")
for i in range(10):
    pr = "chunk: "+str(i)
    sound(testWAV[i], sr, pr)
template chunks
chunk: 0
chunk: 1
chunk: 2
chunk: 3
chunk: 4
chunk: 5
chunk: 6
chunk: 7
chunk: 8
chunk: 9
test names
chunk: 0
chunk: 1
chunk: 2
chunk: 3
chunk: 4
chunk: 5
chunk: 6
chunk: 7
chunk: 8
chunk: 9

Phone Numbers and Names

We can now convert the phone numbers from WAV audio file into string.

In [15]:
tempNames = []
phoneNumber = []

for i in range(10):
    if i % 2 == 0: tempNames.append(np.array_split(tempWAV[i],4)[0])
    else: phoneNumber.append(tempWAV[i])

        
# Find each audio files' MFCC representation using librosa
tempNamesMF = []
for i in range(recipientNum):
    tempNamesMF.append(librosa.feature.mfcc(tempNames[i], sr, n_mfcc = 50))
    
phoneDigs = []
for i in range(recipientNum):
    phoneDigs.append(np.array_split(phoneNumber[i],phoneDigitsAmt))
In [16]:
# Convert phone number to string

phoneNumArr = np.zeros((recipientNum,phoneDigitsAmt))

for i in trange(recipientNum, desc='recipients'):
    for j in tqdm(range(phoneDigitsAmt), desc='digits'):
        # get phone number digits
        curDigMFCC = librosa.feature.mfcc(phoneDigs[i][j], sr, n_mfcc = 50)
        curDigit = classify(curDigMFCC, tempMFList)
        phoneNumArr[i][j]=curDigit

        
phoneNumStr = []
for i in range(recipientNum):
    curStr = ""
    for j in range(phoneDigitsAmt):
        curStr += str(int(phoneNumArr[i][j]))
    phoneNumStr.append(curStr)





In [17]:
print(phoneNumStr)
['1379', '5240', '2783', '7134', '9523']

Accuracy for our digit recognizer here is ~90%. We can see that on index 2, Mohamed's phone number is detected as 2783, even though it's supposed to be6683.

Testing the feature

We can now test the feature to see if our algorithm correctly calls the spoken name.

In [18]:
# Testing to call "Mohamed", phone number 6683
testIdx = 2

title = "input: "+ names[testIdx%5]
sound(testWAV[testIdx%5], sr, title)
guessedNameIdx = classify(testMFListVD[testIdx], tempNamesMF)

print("matches with:")
title = "template: "+ names[guessedNameIdx]
sound(tempNames[guessedNameIdx], sr, title)
print("Dialling {}, with phone number: {}".format(names[guessedNameIdx], phoneNumStr[guessedNameIdx]))
input: Mohamed
matches with:
template: Mohamed
Dialling Mohamed, with phone number: 2783
In [19]:
# Testing to call "Furkan", phone number 1379
testIdx = 0

title = "input: "+ names[testIdx%5]
sound(testWAV[testIdx%5], sr, title)
guessedNameIdx = classify(testMFListVD[testIdx], tempNamesMF)

print("matches with:")
title = "template: "+ names[guessedNameIdx]
sound(tempNames[guessedNameIdx], sr, title)
print("Dialling {}, with phone number: {}".format(names[guessedNameIdx], phoneNumStr[guessedNameIdx]))
input: Furkan
matches with:
template: Furkan
Dialling Furkan, with phone number: 1379

Determining the overall accuracy

Now that we have test the feature, we can determine the overall accuracy of the algorithm with some test sets.
In my case, I have 10 different instances for calling the 5 given names (2 for each).

Actual test index is represented as testIndex mod(5).

In [20]:
correctVD = np.zeros(5)

for i in trange(10, desc='Test Set'):
    guessedNameIdx = classify(testMFListVD[i], tempNamesMF)
    actualIndex = i%5
    correctVD[actualIndex] = correctVD[actualIndex]+1 if guessedNameIdx==actualIndex else correctVD[actualIndex]

The summary of the accuracy for each case is listed below:

In [21]:
## Data Summary
totalVDCorrect = int(np.sum(correctVD))
print("Data Summary:\n")
print("Accuracy: {}%, Correct Guesses: {}, False Guesses: {}\n".format(totalVDCorrect/10*100, totalVDCorrect, 10-totalVDCorrect))

for idx, c in enumerate(correctVD):
    print("Name {} Accuracy: {}%".format(names[idx%5], c/2*100) )
Data Summary:

Accuracy: 100.0%, Correct Guesses: 10, False Guesses: 0

Name Furkan Accuracy: 100.0%
Name Simon Accuracy: 100.0%
Name Mohamed Accuracy: 100.0%
Name Edward Accuracy: 100.0%
Name Amir Accuracy: 100.0%

Based on our small test sets, we can now confirm that our voice driven-dialler is working as expected.

Conclusion

In conclusion, the algorithm used is certainly not optimized for large templates (as it iterates for each template cases, and iterates to find distances for each MFCC frames), which can increase runtime significantly on bigger datasets.

A high accuracy value might be caused by test sets that is fairly similar (I recorded both template and test at similar conditions).